142 research outputs found
Biomedical Named Entity Recognition via Dictionary-based Synonym Generalization
Biomedical named entity recognition is one of the core tasks in biomedical
natural language processing (BioNLP). To tackle this task, numerous
supervised/distantly supervised approaches have been proposed. Despite their
remarkable success, these approaches inescapably demand laborious human effort.
To alleviate the need of human effort, dictionary-based approaches have been
proposed to extract named entities simply based on a given dictionary. However,
one downside of existing dictionary-based approaches is that they are
challenged to identify concept synonyms that are not listed in the given
dictionary, which we refer as the synonym generalization problem. In this
study, we propose a novel Synonym Generalization (SynGen) framework that
recognizes the biomedical concepts contained in the input text using span-based
predictions. In particular, SynGen introduces two regularization terms, namely,
(1) a synonym distance regularizer; and (2) a noise perturbation regularizer,
to minimize the synonym generalization error. To demonstrate the effectiveness
of our approach, we provide a theoretical analysis of the bound of synonym
generalization error. We extensively evaluate our approach on a wide range of
benchmarks and the results verify that SynGen outperforms previous
dictionary-based models by notable margins. Lastly, we provide a detailed
analysis to further reveal the merits and inner-workings of our approach
A Theoretical Analysis of the Repetition Problem in Text Generation
Text generation tasks, including translation, summarization, language models,
and etc. see rapid growth during recent years. Despite the remarkable
achievements, the repetition problem has been observed in nearly all text
generation models undermining the generation performance extensively. To solve
the repetition problem, many methods have been proposed, but there is no
existing theoretical analysis to show why this problem happens and how it is
resolved. In this paper, we propose a new framework for theoretical analysis
for the repetition problem. We first define the Average Repetition Probability
(ARP) to characterize the repetition problem quantitatively. Then, we conduct
an extensive analysis of the Markov generation model and derive several upper
bounds of the average repetition probability with intuitive understanding. We
show that most of the existing methods are essentially minimizing the upper
bounds explicitly or implicitly. Grounded on our theory, we show that the
repetition problem is, unfortunately, caused by the traits of our language
itself. One major reason is attributed to the fact that there exist too many
words predicting the same word as the subsequent word with high probability.
Consequently, it is easy to go back to that word and form repetitions and we
dub it as the high inflow problem. Furthermore, we derive a concentration bound
of the average repetition probability for a general generation model. Finally,
based on the theoretical upper bounds, we propose a novel rebalanced encoding
approach to alleviate the high inflow problem. The experimental results show
that our theoretical framework is applicable in general generation models and
our proposed rebalanced encoding approach alleviates the repetition problem
significantly. The source code of this paper can be obtained from
https://github.com/fuzihaofzh/repetition-problem-nlg.Comment: AAAI 21 Paper with Appendi
FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference
Due to the recent success of diffusion models, text-to-image generation is
becoming increasingly popular and achieves a wide range of applications. Among
them, text-to-image editing, or continuous text-to-image generation, attracts
lots of attention and can potentially improve the quality of generated images.
It's common to see that users may want to slightly edit the generated image by
making minor modifications to their input textual descriptions for several
rounds of diffusion inference. However, such an image editing process suffers
from the low inference efficiency of many existing diffusion models even using
GPU accelerators. To solve this problem, we introduce Fast Image Semantically
Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for
efficient text-to-image editing. The key intuition behind our approach is to
utilize the semantic mapping between the minor modifications on the input text
and the affected regions on the output image. For each text editing step,
FISEdit can automatically identify the affected image regions and utilize the
cached unchanged regions' feature map to accelerate the inference process.
Extensive empirical results show that FISEdit can be and
faster than existing methods on NVIDIA TITAN RTX and A100 GPUs
respectively, and even generates more satisfactory images.Comment: 12 pages, 7 figure
Grand canonical Monte Carlo simulation on adsorption of aniline on the ice surface
Aniline has been found to have frequent environmental occurrence and high toxicity. However, little study has been performed on its environmental fate. Here, we employed Grand Canonical Monte Carlo simulations (GCMC) to investigate the adsorption behavior of aniline on hexagonal ice surface at 200 K using our modified force field of aniline and TIP5P force field of water. The results indicate that the adsorption isotherm of aniline exhibits a “monolayer saturation plateau”, starting with a rapid increase, then a plateau, and finally a condensed phase. Under very low surface coverage, the adsorption isotherm apparently follows Langmuir type adsorption isotherm although anilines can be adsorbed to various sites. Within the range of the apparent Langmuir-type adsorption isotherm, adsorbed anilines are independent from each other and most anilines are almost parallel to the ice surface and form two N−H•••O hydrogen bonds. With the increase of coverage, the adsorbed anilines can interact with each other, resulting in the deviation from the apparent Langmuir-type adsorption isotherm. In addition, the adsorption energy from GCMC simulation (−65.91 kJ mol−1) is well consistent that from our validating quantum chemistry calculation (−69.34 kJ mol−1), further confirming the reliability of our GCMC simulation results.Peer reviewe
BAND: Biomedical Alert News Dataset
Infectious disease outbreaks continue to pose a significant threat to human
health and well-being. To improve disease surveillance and understanding of
disease spread, several surveillance systems have been developed to monitor
daily news alerts and social media. However, existing systems lack thorough
epidemiological analysis in relation to corresponding alerts or news, largely
due to the scarcity of well-annotated reports data. To address this gap, we
introduce the Biomedical Alert News Dataset (BAND), which includes 1,508
samples from existing reported news articles, open emails, and alerts, as well
as 30 epidemiology-related questions. These questions necessitate the model's
expert reasoning abilities, thereby offering valuable insights into the
outbreak of the disease. The BAND dataset brings new challenges to the NLP
world, requiring better disguise capability of the content and the ability to
infer important information. We provide several benchmark tasks, including
Named Entity Recognition (NER), Question Answering (QA), and Event Extraction
(EE), to show how existing models are capable of handling these tasks in the
epidemiology domain. To the best of our knowledge, the BAND corpus is the
largest corpus of well-annotated biomedical outbreak alert news with
elaborately designed questions, making it a valuable resource for
epidemiologists and NLP researchers alike
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder
The sequence-to-sequence (seq2seq) task aims at generating the target
sequence based on the given input source sequence. Traditionally, most of the
seq2seq task is resolved by the Encoder-Decoder framework which requires an
encoder to encode the source sequence and a decoder to generate the target
text. Recently, a bunch of new approaches have emerged that apply decoder-only
language models directly to the seq2seq task. Despite the significant
advancements in applying language models to the seq2seq task, there is still a
lack of thorough analysis on the effectiveness of the decoder-only language
model architecture. This paper aims to address this gap by conducting a
detailed comparison between the encoder-decoder architecture and the
decoder-only language model framework through the analysis of a regularized
encoder-decoder structure. This structure is designed to replicate all
behaviors in the classical decoder-only language model but has an encoder and a
decoder making it easier to be compared with the classical encoder-decoder
structure. Based on the analysis, we unveil the attention degeneration problem
in the language model, namely, as the generation step number grows, less and
less attention is focused on the source sequence. To give a quantitative
understanding of this problem, we conduct a theoretical sensitivity analysis of
the attention output with respect to the source input. Grounded on our
analysis, we propose a novel partial attention language model to solve the
attention degeneration problem. Experimental results on machine translation,
summarization, and data-to-text generation tasks support our analysis and
demonstrate the effectiveness of our proposed model
- …